-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor!: update operations to use delta scan #1639
Conversation
Can be merged after #1705 |
@wjones127 This is ready to be merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great so far! Running out of time for now, but will finish asap!
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! operations are really coming along :).
* feat: extend unit catalog support * chore: draft datafusion integration * fix: allow passing catalog options from python * chore: clippy * feat: add more azure credentials * fix: add defaults for return types * fix: simpler defaults * Update rust/src/data_catalog/unity/mod.rs Co-authored-by: nohajc <[email protected]> * fix: imports * fix: add some defaults * test: add failing provider test * feat: list catalogs * merge main * fix: remove artifact * fix: errors after merge with main * Start python api docs * docs: update Readme (delta-io#1440) # Description With summit coming up I thought we might update our README, since delta-rs has evolved quite a bit since the README was first written... Just opening the Draft to get feedback on the general "patterns" i.e. how the tables are formatted, how detailed we want to show the features and mostly the looks of the header. Also hoping our community experts may have some content they wat to add here 😆. cc @dennyglee @MrPowers @wjones127 @rtyler @houqp @fvaleye --------- Co-authored-by: Will Jones <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]> * Pin chrono to 0.4.30 v0.4.31 was just released which introduces some spurious deprecation warnings * docs: update Readme (delta-io#1633) # Description - Changed the icons as, at first glance, it looked like AWS was not supported (in blue), while the green open icon looked like it was completed - Added one line linking to the Delta Lake docker - Fixed some minor grammar issues Including community experts @roeap @MrPowers @wjones127 @rtyler @houqp @fvaleye to ensure these updates make sense. Thanks! * chore: update datafusion to 31, arrow to 46 and object_store to 0.7 (delta-io#1634) # Description Update datafusion to 31 * chore: relax chrono pin to 0.4 (delta-io#1635) # Description relax chrono pin to improve downstream compatibility. * make create_checkpoint_for public * add documentation to create_checkpoint_for * Implement parsing for the new `domainMetadata` actions in the commit log The Delta Lake protocol which will be released in conjunction with "3.0.0" (currently at RC1) introduces `domainMetadata` actions to the commit log to enable system or user-provided metadata about the commits to be added to the log. With DBR 13.3 in the Databricks ecosystem, tables are already being written with this action via the "liquid clustering" feature. This change enables the clean reading of these tables, but at present nothing novel is done with this information. [Read more here](https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering) Fixes delta-io#1626 Sponsored-by: Databricks Inc * fix: include in-progress row group when calculating in-memory buffer length (delta-io#1638) # Description `PartitionWriter.buffer_len()` is documented as returning: > the current byte length of the in memory buffer. However, this doesn't currently include the length of the in-progress row group. This means that until a row group is flushed, `buffer_len()` returns `0`. Based on the documented description, its length should probably include the bytes currently in-memory as part of an unflushed row group. `buffered_record_batch_count` _does_ include in-progress row groups, so this change also means record count and buffered bytes are reported consistently. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> - closes delta-io#1637 # Documentation <!--- Share links to useful documentation ---> [`buffer_len` on `RecordBatchWriter`](https://docs.rs/deltalake/0.15.0/deltalake/writer/record_batch/struct.RecordBatchWriter.html#method.buffer_len) --------- Co-authored-by: Will Jones <[email protected]> * feat: allow multiple incremental commits in optimize Currently "optimize" executes the whole plan in one commit, which might fail. The larger the table, the more likely it is to fail and the more expensive the failure is. Add an option in OptimizeBuilder that allows specifying a commit interval. If that is provided, the plan executor will periodically commit the accumulated actions. * fix: explicitly require chrono 0.4.31 or greater The Python binding relies on `timestamp_nanos)opt()` which requires 0.4.31 or greater from chroni since it did not previously exist. As a [cargo dependency refresher](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-cratesio) this version range is >=0.4.31, < 0.5.0 which is I believe what we need for optimal downstream compatibility. * Correct some merge related errors with redundant package names from the workspace * Address some latent clippy failures after merging main * Correct the incorrect documentation for `Backoff` * fix: avoid excess listing of log files * feat: pass known file sizes to filesystem in Python (delta-io#1630) # Description Currently the Filesystem implementation always makes a HEAD request when opening a file, to determine the file size. The proposed change is to read the file sizes from the delta log instead, and to pass them down to the `open_input_file` call, eliminating the HEAD request. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> * Proposed updated CODEOWNERS to allow better review notifications Based on current pull request feedback and maintenance trends I'm suggesting these rules to get the right people on the reviews by default. Closes delta-io#1553 * fix: add support for Microsoft OneLake This change introduces tests and support for Microsoft OneLake. This specific commit is a rebase of the work done by our pals at Microsoft. Co-authored-by: Mohammed Muddassir <[email protected]> Co-authored-by: Christopher Watford <[email protected]> * Ignore failing integration tests which require a special environment to operate The OneLake support should be considered unsupported and experimental until such time when we can add integration testing to our CI process * Compensate for invalid log files created by Delta Live Tables It would appear that in some cases Delta Live Tables will create a Delta table which does not adhere to the Delta Table protocol. The metaData action as a **required** `schemaString` property which simply doesn't exist. Since it appears that this only exists at version zero of the transaction log, and the _actual_ schema exists in the following versions of the table (e.g. 1), this change introduces a default deserializer on the MetaData action which provides a simple empty schema. This is an alternative implementation to delta-io#1305 which is a bit more invasive and makes our schema_string struct member `Option<String>` which I do not believe is worth it for this unfortunate compatibility issue Closes delta-io#1305, delta-io#1302, delta-io#1357 Sponsored-by: Databricks Inc * chore: fix the incorrect Slack link in our readme not sure what the deal with the go.delta.io service, no idea where that lives Fixes delta-io#1636 * enable offset listing for s3 * Make docs.rs build docs with all features enabled I was confused that I could not find the documentation integrating datafusion with delta-rs. With this PR, everything should show up. Perhaps docs for a feature gated method should also mention which feature is required. Similar to what Tokio does. Perhaps it could be done in followup PRs. * feat: expose min_commit_interval to `optimize.compact` and `optimize.z_order` (delta-io#1645) # Description Exposes min_commit_interval in the Python API to `optimize.compact` and `optimize.z_order`. Added one test-case to verify the min_commit_interval. # Related Issue(s) closes delta-io#1640 --------- Co-authored-by: Will Jones <[email protected]> * docs: add docstring to protocol method (delta-io#1660) * fix: percent encoding of partition values and paths * feat: handle path encoding in serde and encode partition values in file names * fix: always unquote partition values extracted from path * test: add tests for related issues * fix: consistent serialization of partition values * fix: rounbdtrip special characters * chore: format * fix: add feature requirement to load example * test: add timestamp col to partitioned roundtrip tests * test: add rust roundtip test for special characters * fix: encode characters illegal on windows * docs: fix some typos (delta-io#1662) # Description Saw two typos and marking merge in rust as half-done with a comment on it's current limitation. * feat: use url parsing from object store * fix: ensure config for ms fabric * chore: drive-by simplify test files * fix: update aws http config key * fix: feature gate azure update * feat: more robust azure config handling * fix: in memory store handling * feat: use object-store's s3 store if copy-if-not-exists headers are specified (delta-io#1356) * refactor: re-organize top level modules (delta-io#1434) # Description ~This contains changes from delta-io#1432, will rebase once that's merged.~ This PR constitutes the bulk of re-organising our top level modules. - move `DeltaTable*` structs into new `table` module - move table configuration into `table` module - move schema related modules into `schema` module - rename `action` module to `protocol` - hoping to isolate everything that can one day be the log kernel. ~It also removes the deprecated commit logic from `DeltaTable` and updates call sites and tests accordingly.~ I am planning one more follow up, where I hope to make `transactions` currently within `operations` a top level module. While the number of touched files here is already massive, I want to do this in a follow up, as it will also include some updates to the transactions itself, that should be more carefully reviewed. # Related Issue(s) closes: delta-io#1136 # Documentation <!--- Share links to useful documentation ---> * chore: increment python library version (delta-io#1664) # Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> * fix exception string in writer.py The exception message is ambiguous as it interchanges the table and data schemas. * Update docs * add read me * Add space * feat: allow to set large dtypes for the schema check in `write_deltalake` (delta-io#1668) # Description Currently it was always checking the schema for non-large types, I didn't know before we could change it so in polars we added some schema casting from large to non-large, this however became a problem today when I wanted to write 200M records at once because the array was too big the fit in normal string type. ```python ArrowInvalid: Failed casting from large_string to string: input array too large ``` Adding this flag will allow libraries like polars to write directly with their large dtypes in arrow. If this is merged, I can work on fix in polars to remove the schema casting for these large types. * fix: change partitioning schema from large to normal string for pyarrow<12 (delta-io#1671) # Description If pyarrow is below v12.0.0 it changes the partitioning schema fields from large_string to string. # Related Issue(s) closes delta-io#1669 # Documentation apache/arrow#34546 (comment) --------- Co-authored-by: Will Jones <[email protected]> * chore: bump rust crate version * fix: use epoch instead of ce for date stats (delta-io#1672) # Description date32 statistics logic was subjectively wrong. It was using `from_num_days_from_ce_opt` which > Makes a new NaiveDate from a day's number in the proleptic Gregorian calendar, with January 1, 1 being day 1. while date32 is commonly represented as days since UNIX epoch (1970-01-01) # Related Issue(s) closes delta-io#1670 # Documentation It doesn't seem like parquet actually has a spec for what a `date` should be, but many other tools seem to use the epoch logic. duckdb, and polars seem to use epoch instead of gregorian. Also arrow spec states that date32 should be epoch. for example, if i write using polars ```py import polars as pl # %% df = pl.DataFrame( { "a": [ 10561, 9200, 9201, 9202, 9203, 9204, 9205, 9206, 9207, 9208, 9199, ] } ) # %% df.select(pl.col("a").cast(pl.Date)).write_delta("./db/polars/") ``` the stats are correctly interpreted ``` {"add":{"path":"0-7b8f11ab-a259-4673-be06-9deedeec34ff-0.parquet","size":557,"partitionValues":{},"modificationTime":1695779554372,"dataChange":true,"stats":"{\"numRecords\": 11, \"minValues\": {\"a\": \"1995-03-10\"}, \"maxValues\": {\"a\": \"1998-12-01\"}, \"nullCount\": {\"a\": 0}}"}} ``` * chore: update changelog for the rust-v0.16.0 release * Remove redundant changelog entry for 0.16 * update readme * fix: update the delta-inspect CLI to be build again by Cargo This sort of withered on the vine a bit, this pull request allows it to be built properly again * update readme * chore: bump the version of the Rust crate * fix: unify environment variables referenced by Databricks docs Long-term fix will be for Databricks to release a Rust SDK for Unity 😄 Fixes delta-io#1627 * feat: support CREATE OR REPLACE * docs: get docs.rs configured correctly again (delta-io#1693) # Description The docs build was changed in delta-io#1658 to compile on docs.rs with all features, but our crate cannot compile with all-features due to the TLS features, which are mutually exclusive. # Related Issue(s) For example: - closes delta-io#1692 This has been tested locally with the following command: ``` cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental ``` * fix!: ensure predicates are parsable (delta-io#1690) # Description Resolves two issues that impact Datafusion implemented operators 1. When a user has an expression with a scalar built-in scalar function we are unable parse the output predicate since the `DummyContextProvider`'s methods are unimplemented. The provider now uses the user provided state or a default. More work is required in the future to allow a user provided Datafusion state to be used during the conflict checker. 2. The string representation was not parsable by sqlparser since it was not valid SQL. New code was written to transform an expression into a parsable sql string. Current implementation is not exhaustive however common use cases are covered. The delta_datafusion.rs file is getting large so I transformed it into a module. This implementation makes reuse of some code from Datafusion. I've added the Apache License at the top of the file. Let me know if any else is required to be compliant. # Related Issue(s) - closes delta-io#1625 --------- Co-authored-by: Will Jones <[email protected]> * fix typo in readme * fix: address formatting errors * fix: remove an unused import * feat(python): expose delete operation (delta-io#1687) # Description Naively expose the delete operation, with the option to provide a predicate. I first tried to expose a richer API with the Python `FilterType` and DNF expressions, but from what I understand delta-rs doesn't implement generic filters but only `PartitionFilter`. The `DeleteBuilder` also only accepts datafusion expressions. So Instead of hacking my way around or proposing a refactor I went for the simpler approach of sending a string predicate to the rust lib. If this implementation is OK I will add tests. # Related Issue(s) - closes delta-io#1417 --------- Co-authored-by: Will Jones <[email protected]> * docs(python): document the delete operation * Introduce some redundant type definitions to the mypy stub * chore: fix new clippy lints introduced in Rust 1.73 * Update the sphinx ignore for building =_= * Enable prebuffer * implement issue 1169 * fix format * feat: add version number in `.history()` and display in reversed chronological order (delta-io#1710) # Description Adds the version number to each commit info. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> - Closes delta-io#1561 - Closes delta-io#1680 --------- Co-authored-by: R. Tyler Croy <[email protected]> * feat(python): expose UPDATE operation (delta-io#1694) # Description - Exposes UPDATE operation to Python. - Added two test cases, with predicate and without - Took some learnings in simplifying the code (will apply it in MERGE PR as well) # Related Issue(s) <!--- For example: - closes delta-io#106 ---> Closes delta-io#1505 --------- Co-authored-by: Will Jones <[email protected]> * fix: merge operation with string predicates (delta-io#1705) # Description Fixes an issue when users use string predicates with the merge operation. Parsing a string predicate did not properly handle table references and would always assume a bare table with a table name of the empty string. Now the qualifier is `None` however a `DFSchema` with qualifiers can be supplied where it makes sense. Now users must provide source and target aliases whenever both sides share a column name otherwise the operation will error out. Minor refactoring of the expression parser was also done and allowed using of case expressions. # Related Issue(s) - closes delta-io#1699 --------- Co-authored-by: Will Jones <[email protected]> * refactor!: remove a layer of lifetimes from PartitionFilter (delta-io#1725) # Description This commit removes a bunch of lifetime restrictions on the `PartitionFilter` and `PartitionFilterValue` classes to make them easier to use. While the original discussion in Slack and delta-io#1501 made mention of using a reference type, there doesn't seem to a need for it. A particular instance of a `PartitionFilter` is created once and just borrowed and read for the remainder of its life. Functions, when necessary continue to accept the non-container types (i.e, `&str` and `&[&str]`), allowing their containerized counterparts to continue working with them without needing to borrow or clone the containers (i.e, `String` and `Vec<String>`). # Related Issue(s) - resolves delta-io#1501 # Documentation * feat(python): expose MERGE operation (delta-io#1685) # Description This exposes MERGE commands to the Python API. The updates and predicates are first kept in the Class TableMerger and only dispatched to Rust after `TableMerge.execute()`. This was my first thought on how to implement it since I have limited experience with Rust and PyO3 (still learning 😄). Maybe a more elegant solution is that every class method on TableMerger is dispatched to Rust and then the Rust MergeBuilder gets serialized and sent back to Python (back and forth). Let me know your thoughts on this. If this is better, I could also do this in the next PR, so we at least can push this one out sooner. Couple of issues at the moment, I need feedback on, where the first one is blocking since I can't test it now: ~- Source_alias is not applying, somehow during a schema check the prefix is missing, however when I printed the lines inside merge, it showed the prefix correctly. So not sure where the issue is~ ~- I had to make datafusion_utils public since I needed to get the Expression Struct from it, is this the right way to do that? @Blajda~ Edit: I will pull @Blajda's changes delta-io#1705 once merged with develop: # Related Issue(s) <!--- For example: - closes delta-io#106 ---> closes delta-io#1357 * chore: remove deprecated functions * chore: bump the python package version (delta-io#1734) # Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> * fix: reorder encode_partition_value() checks and add tests (delta-io#1733) # Description The `isinstance(val, datetime)` check was after `isinstance(val, date)` which meant that it was never found. I added a test for each encoding type. --------- Co-authored-by: Robert Pack <[email protected]> * Relax `pyarrow` pin * fix: remove `pandas` pin (delta-io#1746) # Description Removes the `pandas` pin. # Related Issue(s) Resolves delta-io#1745 * docs: get docs.rs configured correctly again (delta-io#1693) # Description The docs build was changed in delta-io#1658 to compile on docs.rs with all features, but our crate cannot compile with all-features due to the TLS features, which are mutually exclusive. # Related Issue(s) For example: - closes delta-io#1692 This has been tested locally with the following command: ``` cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental ``` * Make this a patch release to fix docs.rs * Remove the hdfs feature from the docsrs build * refactor!: update operations to use delta scan (delta-io#1639) # Description Recently implemented operations did not use `DeltaScan` it had some gaps. These gaps would make it harder switch towards logical plans which is required for merge. Gaps: - It was not possible to include file lineage in the result - The subset of files to be scanned is known ahead of time. Users had to reconstruct a parquet scan based on those files The PR introduces a `DeltaScanBuilder` that allow users to specify which files to use when constructing the scan, if the scan should be enhanced to include additional metadata columns, and allows a projection to be specified. It also retains previous functionality of pruning based on the provided filter when files to scan are not provided. `DeltaScanConfig` is also introduced which allows users to deterministic obtain the names of any added metadata columns or allows them to specify the name if required. The public interface for `find_files` has changed but functionality remains the same. A new table provider was introduced which accepts an `DeltaScanConfig`. This is required for future merge enhancements so unmodified files can be pruned pruned prior to writes. --------- Co-authored-by: Robert Pack <[email protected]> * chore: update datafusion (delta-io#1741) Updates arrow and datafusion dependencies to latest. * docs: convert docs to use mkdocs (delta-io#1731) # Description Completed the outstanding tasks in delta-io#1708 Also changed theme from readthedocs to mkdocs - both are built-in but latter looks sleeker # Related Issue(s) closes delta-io#1708 --------- Co-authored-by: Robert Pack <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]> * docs: dynamodb lock configuration (delta-io#1752) # Description I have added documentation in the API and also on the Python usage page regarding this configuration. Please let me know if it is satisfactory, and if not, I am more than happy to address any issues or make any necessary adjustments. # Related Issue(s) - closes delta-io#1674 # Documentation * feat: ignore binary columns for stats generation * feat: honor appendOnly table config (delta-io#1747) # Description Throw an error if a transaction includes Remove action with data change but the Delta Table is append-only. # Related Issue(s) - closes delta-io#352 * chore: fix building/running tests without the datafusion feature This looks like an oversight that our CI didn't test because we have the datafusion feature typically enabled for our tests. The build error would only show up when building tests without it. * add write support explicitly for pyarrow dataset * feat(python): expose FSCK (repair) operation (delta-io#1730) # Description This PR exposes the FSCK operation as a `repair` method under the `DeltaTable `class. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> - closes delta-io#1727 --------- Co-authored-by: Will Jones <[email protected]> * refactor: perform bulk deletes during metadata cleanup In addition to doing bulk deletes, I removed what seems like (at least to me) unnecessary code. At it's core, files are considered up for deletion when their last_modified time is older than the cutoff time AND the version if less than the specific version (usually the latest version). * Make an attempt at improving the utilization of delete_stream for cleaning up expired logs This change builds on @cmackenzie1's work and feeds the list stream directly into the delete_stream with a predicate function to identify paths for deletion * start to add vacuum into transaction log * add vacuum operations in transaction log * attempt to calculate size * add test * chore: bump Python package version * fix: ignore inf in stats * doc(README): remove typo * enhance docs to enable multi-lingual examples * use official Python API for references * chore: refactor into the deltalake meta crate and deltalake-core crates This puts the groundwork in place for starting to partition into smaller crates in a simpler and more manageable fashion. See delta-io#1713 * Correct the working directory for the parquet2 tests * feat: add deltalake sql crate (delta-io#1757) # Description This is an fairly early draft to create logical plans from sql using the datafusion abstractions. Adopted the patterns over there quite closely since the ultimate goal would be to ask the datafusion community if they would accept these changes within the datafusion sql crate ... # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <[email protected]> * rollback resolve bucket region change --------- Co-authored-by: Robert Pack <[email protected]> Co-authored-by: Robert Pack <[email protected]> Co-authored-by: nohajc <[email protected]> Co-authored-by: Will Jones <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]> Co-authored-by: Denny Lee <[email protected]> Co-authored-by: QP Hou <[email protected]> Co-authored-by: haruband <[email protected]> Co-authored-by: Ben Magee <[email protected]> Co-authored-by: Constantin S. Pan <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Mohammed Muddassir <[email protected]> Co-authored-by: Christopher Watford <[email protected]> Co-authored-by: Simon Vandel Sillesen <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: Matthew Powers <[email protected]> Co-authored-by: Sébastien Diemer <[email protected]> Co-authored-by: Cory Grinstead <[email protected]> Co-authored-by: Trinity Xia <[email protected]> Co-authored-by: hnaoto <[email protected]> Co-authored-by: universalmind303 <[email protected]> Co-authored-by: David Blajda <[email protected]> Co-authored-by: Josiah Parry <[email protected]> Co-authored-by: Guilhem de Viry <[email protected]> Co-authored-by: Nikolay Ulmasov <[email protected]> Co-authored-by: Cole Mackenzie <[email protected]> Co-authored-by: ldacey <[email protected]> Co-authored-by: Dave Hirschfeld <[email protected]> Co-authored-by: David Blajda <[email protected]> Co-authored-by: Brayan Jules <[email protected]> Co-authored-by: emcake <[email protected]> Co-authored-by: Junjun Dong <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: Deep145757 <[email protected]>
Description
Recently implemented operations did not use
DeltaScan
it had some gaps. These gaps would make it harder switch towards logical plans which is required for merge.Gaps:
The PR introduces a
DeltaScanBuilder
that allow users to specify which files to use when constructing the scan, if the scan should be enhanced to include additional metadata columns, and allows a projection to be specified. It also retains previous functionality of pruning based on the provided filter when files to scan are not provided.DeltaScanConfig
is also introduced which allows users to deterministic obtain the names of any added metadata columns or allows them to specify the name if required.The public interface for
find_files
has changed but functionality remains the same.A new table provider was introduced which accepts an
DeltaScanConfig
. This is required for future merge enhancements so unmodified files can be pruned pruned prior to writes.